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Abstract. The last decade has witnessed a rapid proliferation of su- 
perscalar cache-based microprocessors to build high-end capability and 
capacity computers primarily because of their generality, scalability, and 
cost effectiveness. However, the recent development of massively paral- 
lel vector systems is having a significant effect on the supercomputing 
landscape. In this paper, we compare the performance of the recently- 
released Cray XI vector system with that of the cacheless NEC SX-6 vec- 
tor machine, and the superscalar cache-based IBM Power3 and Power4 
architectures for scientific applications. Overall results demonstrate that 
the XI is quite promising, but performance improvements are expected 
as the hardware, systems software, and numerical libraries mature. Code 
reengineering to effectively utilize the complex architecture may also lead 
to significant efficiency enhancements. 


1 Introduction 

The last decade has witnessed a rapid proliferation of superscalar cache-based 
microprocessors to build high-end capability and capacity computers [11]. This 
is primarily because their generality', scalability, and cost effectiveness convinced 
computer vendors, buyers, and users that vector architectures hold little promise 
for future large-scale supercomputing systems. 

Two major elements have dramatically affected this common perception. The 
first is the recent availability of the Japanese Earth Simulator [3] that, based on 
NEC SX-6 vector technology, is the world’s most powerful supercomputer both 
in terms of peak (40.96 TFlops/s) and sustained LINPACK (87.5% of peak) 
performance. The second is the constant degradation in sustained performance 
(now typically less than 10% of peak) for scientific applications running on SMP 
clusters built from processors using superscalar technology. The Earth Simulator, 
on the other hand, has demonstrated sustained performance of almost 65% of 
peak for a production global atmospheric simulation [10]. 

In order to quantify what a vector capability entails for scientific communities 
that rely on modeling and simulation, it is critical to evaluate it in the context 
of demanding computational algorithms. In our previous work [8], we conducted 


an evaluation of the SX-6 for the NAS Parallel Benchmarks and a suite of six 
applications. In this paper, we compare the performance of the Cray Xl vector 
testbed system at Oak Ridge National Laboratory with that of the SX-6, and 
the superscalar cache-based IBM Power3 and Power4 architectures for scientific 
applications. Even though both the SX-6 and the Xl are vector architectures, 
the Xl is unique in that it has a data cache (the SX-6 is cacheless). 

We begin our performance evaluation by comparing memory bandwidth and 
scatter/gather hardware behavior of a triad summation using regularly and ir- 
regularly strided data. Next, we use two synthetic benchmarks: a fast Fourier 
transform and an N-body soiver to further compare the four target architec- 
tures. Finally, we present performance results for three scientific applications 
from cosmology (MADCAP), materials science (PARATEC), and fluid dynamics 
(OVERFLOW-D). For each of these benchmarks and applications, we examine 
the effort required to port them to the Xl. Both intra- and inter- node parallel 
performance (in GFlops/s per processor) is reported for our application suite 
wnen running a small and a large test case. Xl results are generally promis- 
ing; however, we expect performance improvements as the machine stabilizes 
to production mode. Note that none of the applications have been specifically 
optimized for the Xl, so appropriate reengineering to utilize the architecture 
effectively would also be beneficial. 

2 Target Architectures 

We give a brief description of the salient architectural features of the'Xl, as well 
as those of the three other machines that were examined as part of this study 
Table 1 presents a summary of their node characteristics. Observe that the Xl 
has the highest peak performance and that its memory subsystem features a data 
rate more than 50% higher than the SX-6 and an order of magnitude greater 
than the Power systems. 


Table 1 . Architectural specifications of the Xl, Power3, Power4, and SX-6 nodes 


Node 

Type 

CPU/ 

Node 

Clock 

(MHz) 

Peak 

(GFlops/s) 

Memory BW 
(GB/s) 

Peak 

Bytes/Flop 

MPI Latency 
(fj, sec) 

Xl 

4 

800 

12.8 

50 

16 

8.2 

Power3 

16 

375 

1.5 

0.7 

0.47 

8.6 

Power4 

32 

1300 

5.2 

2.3 

0.44 

3.0 

SX-6 

8 

500 

8.0 

32 

4.0 

2.1 


2.1 Cray Xl 

The recently-released Xl is designed to combine traditional vector strengths 
with the generality and scalability features of modern superscalar cache-based 
parallel systems. The computational core, called the single-streaming proces- 
sor (SSP), contains two 32-stage vector pipes running at 800 MHz. Each SSP 
contains 32 vector registers holding 64 double-precision words, and operates at 




Fig. 1 . Architecture of the XI node (in red), multi-streaming processor (in green), and 
single-streaming processor (in blue). 

3.2 GFlops/s peak for 64-bit data. All vector operations are performed under a 
bit mask, allowing loop blocks with conditionals to compute without the need for 
scatter/gather operations. Additionally, each SSP can have up to 512 addresses 
simultaneously in flight, thus hiding the latency overhead for a potentially signif- 
icant number of memory fetches. The SSP also contains a two-way out-of-order 
superscalar processor running at 400 MHz with two 16KB caches (instruction 
and data) . The scalar unit operates at a quarter of the performance of the vector 
pipes, making a high vector operation ratio critical for effectively utilizing the 
underlying hardware. Fig. 1 shows a schematic of the XI architecture. 

The multi-streaming processor (MSP) combines four SSPs into one logical 
computational unit. The four SSPs share a 2-way set associative 2MB data 
Ecache, a unique feature for vector architectures that allows extremely high 
bandwidth (25-51 GB/s) for computations with temporal data locality. An XI 
node consists of four MSPs sharing a flat memory through 16 memory con- 
trollers (MChips). Each MChip is attached to a local memory bank (MBank), 
for an aggregate of 200 GB/s node bandwidth. Additionally, MChips can be used 
to directly connect up to four nodes (16 MSPs) and participate in remote address 
translation. To build large configurations, a modified 3D torus interconnect is 
implemented via specialized routing chips. The torus topology allows scalability 
to large processor counts with relatively few links compared with the crossbar 
interconnect of the IBM Power systems or the Earth Simulator; however, the 3D 
torus suffers from limited bisection bandwidth. Finally, the XI is a globally ad- 
dressable architecture, with specialized hardware support that allows processors 
to directly read or write remote memory addresses in an efficient manner. 
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The XI programming model is designed to hierarchically leverage parallelism. 
At the SSP level, vector instructions allow a large number of SIMD operations 
(64) to execute in a pipeline fashion, thereby masking memory latency and al- 
lowing for high sustained performance. MSP parallelism is achieved by distribut- 
ing loop iterations across each of the four SSPs. The compiler must therefore 
generate both vectorizing and multistreaming instructions to effectively utilize 
the XI. Intra-node parallelism across the four MSPs is explicitly controlled us- 
ing shared-memory directives such as OpenMP or Pthreads. Finally, traditional 
message passing via MPI is used for coarse-grain parallelism at the inter-node 
level. Additionally, the hardware supported globally addressable memory allows 
efficient implementations of one-sided communication libraries (SHMEM, MPI- 
2), as well as implicitly parallel programming languages (UPC and CAF). 

All XI experiments reported in this paper were performed on the 128-MSP 
system running UNICOS/mp 2.3.07 and operated by Oak Ridge National Lab- 
oratory. 


2.2 IBM Power 3 

The Power3 experiments reported here were conducted on the 380-node IBM 
pSeries system running AIX 5.1 and located at Lawrence Berkeley National 
Laboratory. Each 375 MHz processor contains two floating-point units (FPUs) 
that can issue a multiply-add (MADD) per cycle for a peak performance of 
1.5 GFlops/s. The Power3 has a pipeline of only three cycles, diminishing the 
penalty for mispredicted branches. The out-of-order architecture uses prefetching 
to reduce pipeline stalls due to cache misses. The CPU has a 32KB instruction 
cache, a 128KB 128- way set associative LI data cache, and an 8MB four-way set 
associative L2 cache with its own private bus. Each SMP node consists of 16 pro- 
cessors connected to main memory via a crossbar. Multi-node configurations are 
networked via the Colony switch using an omega-type topology. 


2.3 IBM Power4 

The Power4 experiments were performed on the 27-node IBM pSeries 690 system 
running AIX 5.1 and operated by Oak Ridge National Laboratory. Each 32- 
way SMP consists of 16 Power4 chips (organized as 4 MCMs), where a chip 
contains two 1.3 GHz processor cores. Each core has two FPUs capable of a fused 
MADD per cycle, for a peak performance of 5.2 GFlops/s. The superscalar out- 
of-order architecture can exploit instruction level parallelism through its eight 
execution units. Advanced branch prediction hardware minimizes the effects of 
the relatively long pipeline (six cycles) necessitated by the high frequency design. 
Each processor contains its own private LI cache (64KB instruction and 32KB 
data) with prefetch hardware; however, both cores share a 1.5MB unified L2 
cache. The L3 is designed as a stand-alone 32MB cache, or to be combined with 
other L3s on the same MCM to create a larger 128MB interleaved cache. Multi- 
node configurations are currently available employing the Colony interconnect, 
but future large-scale systems will use the lower latency Federation switch. 
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2.4 NEC SX-6 


The SX-6 results in this work were obtained on the single- node system running 
SUPER-UX at the Arctic Region Supercomputing Center. The SX-6 vector pro- 
cessor uses a dramatically different architectural approach than conventional 
cache-based systems. Vectorization exploits regularities in the computational 
structure of scientific applications to expedite uniform operations on independent 
data sets. The 500 MHz SX-6 processor contains an 8-way replicated vector pipe 
capable or issuing a eacii cycle, for a peak periomiaiice of 8.0 GFlops/s 

per CPU. The processors contain 72 vector registers, each holding 256 64- bit 
words. For non-vectorizable instructions, the SX-6 contains a 500 MHz scalar 
processor with a 64KB instruction cache, a 64KB data cache, and 128 general- 
purpose registers. The 4- way superscalar unit has a peak of 1.0 GFlops/s and 
supports branch prediction, data prefetching, and out-of-order execution. 

Unlike conventional architectures, the SX-6 vector unit lacks data caches. 
It therefore masks memory latencies by overlapping pipelined vector operations 
with memory fetches. The SX-6 uses high speed SDRAM with a peak bandwidth 
of 32 GB/s per CPU; enough to feed one operand per cycle to each of the 
replicated pipe sets. The nodes can be used as building blocks of large-scale 
multi-processor systems; for instance, the Earth Simulator contains 640 SX-6 
nodes, connected through a single-stage crossbar. 

3 Microbenchmarks 

We first present the performance of a microbenchmark that measures some 
low-level machine characteristics such as memory subsystem behavior and scat- 
ter/gather hardware support. Fig. 2(a) presents asymptotic memory bandwidth 
of the triad summation: a(i) — b(i) -f s x c(i) using various memory strides for 
the four architectures in our study. The SX-6 achieves the best bandwidth: up to 
one, two, and three orders of magnitude better than the XI, Power4, and Power3, 
respectively. Observe that certain strides impact XI and SX-6 bandwidth quite 
pronouncedly, by an order of magnitude or more. Analysis shows that strides 
containing factors of two worsen performance due to increased DRAM bank 
conflicts. On the Power architectures, a precipitous drop in the transfer rate 
occurs for small strides, due to loss of cache reuse. This drop is more severe on 
the Power4, because of its more complicated cache structure. 

Fig. 2(b) presents single-processor memory bandwidth of indirect addressing 
through vector triad gather/scatter operations of various data sizes. For smaller 
sizes, the cache-based architectures show better data rates for indirect access 
to memory, but for larger data sizes, the XI and the SX-6 effectively utilize 
their gather/scatter hardware support, outperforming the cache- based systems. 
The XI performs better than the SX-6 only for data sizes larger than 900KB; 
however, asymptotic bandwidth for the XI is only marginally higher. Overall, the 
XI memory subsystem behavior is surprisingly poor given its large peak memory 
bandwidth (50 GB/s per processor) and sophisticated hardware support for non- 
unit stride data access. 
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Fig. 2. Single-processor triad performance (in MB/s) using (a) regularly strided data 
of various strides, and (b) irregularly strided data of various sizes. 


4 Synthetic Benchmarks 

We next use two synthetic benchmarks to compare and contrast the performance 
of the four target architectures. The first benchmark is FT, a fast Fourier trans- 
form (FFT) kernel, from the NAS Parallel Benchmarks (NPB) suite [7]. FFT 
is a simple evolutionary problem in which a time-dependent partial differential 
equation is transformed to a set of uncoupled ordinary differential equations in 
phase space. The bulk of the computational work consists of a set of discrete 
multi-dimensional FFTs. Each FFT accesses the same single large array multiple 
times, each time with a fixed but different stride. 

The second synthetic benchmark is N-BODY, an irregularly structured code 
that simulates the interaction of a system of particles in 3D over a number of 
time steps, using the hierarchical Barnes-Hut algorithm. There are three main 
stages: building an octree to represent the distribution of the particles; brows- 
ing the octree to calculate the forces; and updating the particle positions and 
velocities based on the computed forces. Data access is scattered, and involves 
indirect addressing and pointer chasing. N-BODY requires all-to-all all-gather 
communication and demonstrates unpredictable send/receive patterns. The MPI 
version of this benchmark is derived from the SPLASH-2 suite [12]. 

FT and N-BODY results in GFlops/s per processor are presented in Fig. 3. 
FT did not perform well on the vector machines in its original form because the 
computations used a fixed block size of 16 words. Once the code was modified 
to use a block size equal to the grid dimension, vector performance improved 
significantly due to increased vector length. Speedup from one to two processors 
on the XI and the SX-6 is poor due to the time spent improving data local- 
ity for cache-based machines (this is not performed for uniprocessor runs), but 
subsequent scalability is excellent. Power4 performance suffers dramatically for 
P = 64 because of inferior inter-node communication. In terms of sustained per- 
formance, the XI is comparable to the Power systems (an average of about 8% 
of peak) but half that of the SX-6. 
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Fig. 3. Per-processor performance (in GPlops/s) for FT (left) and N-BODY (right). 


Performance for the N-BODY benchmark (see Fig. 3, right) is completely 
different in that the scalar processors are far superior than the vector proces- 
sors. For example, the Power3 achieves 7% of peak . while the XI runs at 0.3%. 
The dominant part of the code is browsing the octree and computing the inter- 
particle forces. Browsing the octree involves significant pointer chasing, which is 
a very expensive operation on vector machines and cannot be vectorized well. 
We tried to improve vectorization by separating the octree browsing and force 
calculation phases; however, results were only negligibly better. These results 
demonstrate that although the vector architectures are equipped with scalar 
units, non- vectorized codes will perform extremely poorly on these platforms. 

5 Scientific Applications 

Three scientific applications were chosen to measure and compare the perfor- 
mance of the XI with that of the SX-6, Power3, and Power4. The applications 
are: MADCAP, a cosmology code that extracts tiny signals from extremely noisy 
observations of the Cosmic Microwave Background; PARATEC, a first principles 
materials science code that solves the Kohn-Sham equations to obtain electronic 
wavefunctions: and OVERFLOW-D, a computational fluid dynamics production 
code that solves the Navier- Stokes equations around complex aerospace config- 
urations. SX-6 results are presented for only one node ( P < 8) while those on 
Power3 (P > 16), Power 4 (P > 32), and XI (P > 4) are for multi- node runs. 
Performance is shown in GFlops/s per processor. To characterize the level of 
vectorization, we also report average vector length for the XI. All performance 
numbers were obtained with hpincount on the Power systems, ftrace on the 
SX-6, and pat on the XI. Missing results will be presented in the final paper. 

5.1 Cosmology 

From the application area of cosmology, we examined the Microwave Anisotropy 
Dataset Computational Analysis Package (MADCAP) [1]. MADCAP imple- 
ments the current optimal general algorithm for extracting the most useful cos- 
mological information from total-power observations of the Cosmic Microwave 
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Background (CMB). The CMB is a snapshot of the Universe when it first became 
electrically neutral some 400.000 years after the Big Bang. The tiny anisotropies 
in the temperature* and polarization of the CMB radiation are sensitive probes 
of early Universe cosmology, and measuring their detailed statistical properties 
has been a high priority in the field for over 30 years. MADCAP was designed 
to calculate the maximum likelihood two-point angular correlation function (or 
power spectrum) of the CMB given a noisy, pixelized sky map and its associated 
pixel-pixel noise correlation matrix. 

MADCAP recasts the extraction of a CMB power spectrum from a sky map 
into a problem in dense linear algebra, and exploits ScaLAPACK libraries for 
its efficient parallel solution. The goal is to maximize the likelihood function of 
all possible cosmologies given the data d. With minimal assumptions, this can 
be reduced to maximizing d’s (reduced to a pixelized sky map) Gaussian log- 
likelihood C(d\Cb) = ~\(d T D~ l d- Tr [lnD]) over all possible power spectrum 
n. « - 
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Using Newton- Raphson iteration to locate the peak of C(d\Cb) requires the 
evaluation of its first two derivatives with respect to C& . This involves first build- 
ing D as the sum of the experiment-specific noise correlations and theory- specific 
signal correlations, and then solving a square linear system for each of the Mb 
spectral coefficients. To minimize communication overhead, this step is com- 
puted through explicit inversion of D (symmetric positive definite) and direct 
matrix- matrix multiplication. These operations scale as Mb M£ for a map with 
M p pixels, typically 10 4 -10° for real experiments. 

To take advantage of large parallel computer systems while maintaining high 
performance, the MADCAP implementation has recently been rewritten to ex- 
ploit the parallelism inherent in performing Mb independent matrix- matrix mul- 
tiplications. The analysis is split into two steps: first, all the processors build 
and invert D; then, the processors are divided into gangs, each of which per- 
forms a subset of the multiplications (gang-parallel) . Since the matrices involved 
are block- cyclically distributed over the processors, this incurs the overhead of 
redistributing the matrices between the two steps. 


XI Porting Porting MADCAP to vector architectures is straightforward, since 
the package utilizes ScaLAPACK to perform dense linear algebra calculations. 
However, it was necessary to rewrite the dSdC routine that computes the pixel- 
pixel signal correlation matrices, scales as MbMp , and does not have any BLAS 
library calls. The basic structure of dSdC loops over all pixels and calculates 
the value of Legendre polynomials up to some preset degree for the angular 
separation between these pixels. On superscalar architectures, this constituted 
a largely insignificant amount of work, but since the computation of the poly- 
nomials is recursive, prohibiting vectorization, the cost was appreciable on both 
the XI and the SX-6. The dSdC routine was therefore rewritten so that at each 
iteration of the Legendre polynomial recursion, a large batch of angular separa- 
tions was computed in an inner loop. Compiler directives were required to ensure 
vectorization for both the vector machines. 


Table 2. Performance of MADCAP on target machines 



Power3 

Power4 

SX-6 

XI 

p 

GFlops/s % Peak 

GFlops/s 

% Peak 

GFlops/s 

% Peak 

GFlops/s 

% Peak | 

Small synthetic case: A f p ~ 8192, J\f b = 

16, without gang-parallelism 

1 

1 

0.844 

56.3 

1.98 

38.1 

4.55 

56.9 

5.36 

41.9 

4 

0.656 

43.7 

1.23 

23.7 

2.66 

33.2 

4.26 

33.3 

9 

0.701 

46.7 



— 

— 

3.22 

25.2 | 

i a 

xv 

0.520 

34.7 

O *701 
1 ^ 

14.1 

— 

— 

2.48 

IQ d 

25 

0.552 

36.8 



— 

— 

1.84 

14.4 | 

Large real case: M. p = 

= 14966, M 

= 16, with gang-parallelism 


4 

0.831 

55.4 

1.90 

36.5 

4.24 

53.0 

5.66 

44.2 

8 

0.688 

45.9 

1.62 

31.2 

3.24 

40.5 



16 

0.615 

41.0 

1.26 

24.2 

— 

— 



32 

0.582 

38.8 

0.804 

15.5 

— 

— 


j 

64 

0.495 

33.0 

0.524 

10.1 

— 

— 
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. Performance Results Table 2 first shows the performance of MADCAP for 
a small synthetic dataset with J\f p — 8192 and A & = 16 on each of the four 
architectures under investigation, without the use of gang-parallelism. Note that 
for maximum efficiency, the processor sizes are set to squares of integers. As 
expected, the single processor runs achieve a significant fraction of the peak 
performance since the analysis is dominated by BLAS3 operations (at least 60%). 
However, as we use more processors with a fixed data_size, the density of the 
data on each processor decreases, the communication overhead increases, and 
performance drops significantly Observe that the XI attains the highest raw 
performance, achieving factors of 6.5x, 3.5x, and 1.6x speedup compared with 
the Power3, Power4, and SX-6 respectively on 4 processors. With increasing 
processor counts, the fraction of time spent in BLAS3 computations decreases, 
causing performance degradation. The average vector length was about 62 per 
SSP (vector length 64) for all processor counts. It is therefore surprising that 
MADCAP did not attain a higher percentage of peak on the XI. 

Table 2 also shows the performance of the gang-parallel implementation of 
MADCAP for a larger real dataset from the Millimeter- wave Anisotropy Exper- 
iment Imaging Array (MAXIMA) [5], with Ap = 14996 and A& = 16. MAX- 
IMA is a balloon-borne experiment with an array of 16 bolometric photome- 
ters optimized to map the CMB anisotropy over hundreds of square degrees. 
The gang parallel technique is designed to preserve the density of data during 
matrix-matrix multiplication and produce close to linear speedup. In this regard, 
the results on the Power3 and Power4 are somewhat disappointing. This is be- 
cause the other steps in MADCAP do not scale that well, and the efficiency of 
each gang suffers due to increased memory contention. Currently, we cannot run 
more than one (4-processor) gang on the XI because the ScaLAPACK function 
blacs_gridjnap does not work properly. This function is required to map a set of 
MPI tasks onto several distinct BLACS contexts, one per gang. Cray engineers 
are working to address this problem. 
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5.2 Materials Science 


The application that we selected from materials science is called PARATEC: 
PARAllel Total Energy Code [9]. The code performs ab-initio quant urn- mechani- 
cal total energy calculations using pseudopotentials and a plane wave basis set. 
The pseudopotentials are of the standard norm-conserving variety. Forces can 
be easily calculated and used to relax the atoms into their equilibrium posi- 
tions. PARATEC uses an all-band conjugate gradient (CG) approach to solve 
rhe Kohn-Sham equations of Density Functional Theory (DFT) and obtain the 
ground-state electron wavefunctions. DFT is the most commonly used technique 
in materials science, having a quantum mechanical treatment of the electrons, 
to calculate the structural and electronic properties of materials. Codes based 
on DFT are widely used to study properties such as strength, cohesion, growth, 
magnetic, optical, and transport for materials like nanostructures, complex sur- 
faces, and doped semiconductors. 

In solving the Kohn-Sham equations using a plane wave basis, part of the 
calculation is carried out in real space and the remainder in Fourier space using 
specialized parallel 3D FFTs to transform the wavefunctions. The code spends 
most of its time (over 70% for a large system) in vendor supplied BLAS3 and 
ID FFTs on which the 3D FFTs are built. For this reason, PARATEC generally 
obtains a high percentage of peak performance on different platforms. The code 
exploits fine-grained parallelism by dividing the plane wave components for each 
electron among the different processors [4]. 


XI Porting PARATEC is an MPI package designed primarily for massively 
parallel computing platforms, but can also run on serial machines. Since much 
of the computation involves FFTs and BLAS3, an efficient vector implementa- 
tion of the code requires these libraries to vectorize well. While this is true for 
the BLAS3 routines on the XI, the standard FFTs run at a low percentage of 
peak. Some code transformations were therefore required to convert the 3D FFT 
routines to simultaneous (“multiple”) ID FFT calls. Compiler directives were 
also inserted in certain code segments to force vectorization on the XI in loops 
where indirect indexing is used. 


Performance Results The results in Table 3 are for 3 CG steps of 250 and 
432 Si-atom bulk systems and a standard LDA run of PARATEC with a 25 
Ry cut-off using norm- conserving pseudopotentials. A typical calculation would 
require between 20 and 60 CG iterations to converge the charge density. 

Results show that PARATEC vectorizes well on the SX-6 for the small test 
case. The XI is significantly slower, even though it has a higher peak speed. One 
reason is that, on the XI, the code spends a much smaller percentage of the 
total time in highly optimized 3D FFTs and BLAS3 libraries than any of the 
other machines. This lowers sustained performance as other parts of the code 
run well below peak. Some of the loss in XI scaling for P > 4 is also due to inter- 
node communication and shorter vector lengths. The parallel 3D FFTs require 
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Table 3. Performance of PARATEC on target machines 


p 

Power3 

Power4 

SX-6 

XI 

GFlops/s % Peak 

GFlops/s 

% Peak 

| GFlops/s 

% Peak 

GFlops/s 

% Peak 

| Small case: 250 Si-atom bulk system 

1 

0.915 

61.0 

2.29 

44.0 

5.09 

63.6 

2.97 

23.2 

2 

0.915 

61.0 

2.25 

43.3 

4.98 

62.3 

2.82 

22.0 

4 

0.920 

61.3 

2.21 

42.5 

4.70 

58.8 

2.65 

20.7 

8 

0.911 

60.7 

2.09 

40.2 

4.22 

52.8 

2.48 

19.4 

16 

0.840 

56.0 

1.57 

30.2 

— 

— 

1.99 

15-5 

32 

0.725 

48.3 

1.33 

25.6 

— 

— 

1.67 

13.0 

| Large case: 

432 Si-atom bulk system 





32 

0.959 

63.9 

1.49 

28.7 

— 

— 

2.55 

19.9 

64 

0.736 

49.1 



1 — 

— 

1.97 

15.4 

96 

0.523 

34.9 



— 

— 

1.49 

11.6 

128 

0.516 

34.4 



— 

— 

1.25 

9.8 


a transformation of the distributed grid which results in global communication. 
The average vector length drops from 49 for the uni-processor run to 44 on 16 
processors. Performance for the large test case is poor for P = 128, primarily 
due to low average vector length (35) and increased communication overhead. 
PARATEC runs efficiently on the Power3, but performance on the Power4 is 
affected by the relatively poor ratio of memory bandwidth to peak performance. 
Nonetheless, the Power systems obtain a much higher percentage of peak than 
the XI for all runs. Faster FFT libraries and some code rewriting to increase 
vectorization in conjunction with compiler directives could significantly improve 
the performance of PARATEC on the XI. 

5.3 Fluid Dynamics 

In the area of computational fluid dynamics (CFD), we selected the NASA 
Navier- Stokes production application called OVERFLOWED [6]. The code uses 
the overset grid methodology [2] to perform high-fidelity viscous simulations 
around realistic aerospace configurations. It is popular within the aerodynamics 
community due to its ability to handle complex designs with multiple geomet- 
ric components. OVERFLOW- D is explicitly designed to simplify the modeling 
of problems when components are in relative motion. The main computational 
logic at the top level of the sequential code consists of a time- loop and a nested 
grid-loop. Within the grid- loop, solutions to the flow equations are obtained 
on the individual grids with imposed boundary conditions. Overlapping bound- 
ary points or inter-grid data are updated from the previous time step using a 
Chimera interpolation procedure. Upon completion of the grid-loop, the solution 
is automatically advanced to the next time step by the time-loop. The code uses 
finite differences in space, with a variety of implicit /explicit time stepping. 

The MPI version of OVERFLOW-D takes advantage of the overset grid sys- 
tem, which offers a natural coarse- grain parallelism. A bin- packing algorithm 
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clusters individual grids into groups, each of which is then assigned to an MPI 
process. The grouping strategy uses a connectivity test that inspects for an over- 
lap between a pair of grids before assigning them to the same group, regardless of 
the size of the boundary data or their connectivity to other grids. The grid-loop 
in the parallel implementation is subdivided into two procedures: a group-loop 
over groups, and a grid-loop over the grids within each group. Since each MPI 
process is assigned to only one group, the group-loop is executed in parallel, 
with each group performing its own sequential grid-loop. The inter-grid bound- 
ary updates within each group are performed as in the serial case. Inter-group 
boundary exchanges are achieved via MPI asynchronous communication calls. 

XI Porting The MPI implementation of OVERFLOW-D is based on the se- 
quential version that was designed to exploit early Cray vector machines. The 
same basic program structure is used on all four target architectures except for 
a few minor changes in some subroutines to meet specific compiler requirements. 
In porting the code to the XI, three binary input files also had to be converted 
from the default IEEE format to 64-bit (integer*8 and real*8) data types. 
Furthermore, for data consistency between FORTRAN and C, the int data in 
all C functions were converted to long int. The main loops in compute-intensive 
sections of the code were multi-streamed and vectorized automatically by the 
compiler; however, modifications were made to allow vectorization of certain 
other loops via pragma directives. 


Performance Results The results in Table 4 are for 100 time steps of an 
OVERFLOW-D simulation of vortex dynamics in the complex wake flow region 
around hovering rotors. A typical calculation would require several thousand 
time steps. Note that the current MPI implementation of OVERFLOW-D does 
not allow uni-processor runs. The grid system for the small test case consists of 
41 blocks and about 8 million grid points, while that for the large case has 857 
blocks and 69 million grid points. The Cartesian off-body wake grids surround 
the curvilinear near-body grids with uniform resolution, but become gradually 
coarser upon approaching the outer boundary of the computational domain. 

Results for the small case demonstrate that the XI is about a factor of 2x 
slower than the SX-6 and 1.5x (4x) faster than the Power 4 (Power3), based on 
GFiops/s per processor. The SX-6 outperforms the other machines; e.g. its ex- 
ecution time for P = 8 (1.6 secs) is 90%, 50%, and 21% of the P = 16 timings 
for the XI, Power4, and Power3. The. SX-6 also consistently achieves the highest 
percentage of peak performance while the XI is the lowest. Scalability is similar 
for all mac hin es except the Power4, with computational efficiency decreasing for 
a larger number of MPI tasks primarily due to load imbalance. Performance 
is particularly poor for P = 32 when 41 zones have to be distributed to 32 
processors. On the XI, a relatively small average vector length of 24 per SSP 
explains why the code achieves an aggregate of only 3.6 GFlops/s on 8 pro- 
cessors (3.5% of peak). Although the same general trends hold for the large 
case, performance is slightly better (for the same number of processors) because 
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Table 4. Performance of OVERFLOW-D on target machines 


p 

Power3 

Power4 

SX-6 

XI 

GFlops/s % Peak 

GFlops/s % Peak 

GFlops/s 

% Peak 

GFlops/s % Peak 

Small case: 8 million grid points 

2 

0.133 

8.9 

0.394 

7.6 

1.13 

14.1 

0.513 

4.0 

4 

0.117 

7.8 

0-366 

7.0 

1.11 

13.9 

0.479 

3.7 

8 

0.118 

7.9 

0.362 

7.0 

0.97 

12.1 

0.445 

3.5 

! ic 

0.097 

6.5 

0.210 

4.0 

— 

— 

n 433 

3.4 ! 

iLki 

0.087 

5.8 

0.115 

2.2 

— 

— 

0.278 

2.2 1 

Large case: 69 million 

grid points 






16 

0.115 

7.7 

0.282 

5.4 

— 

— 

0.456 

3.6 

32 

0.109 

7.3 

0.246 

4.7 

— 

— 

0.413 

3.2 

64 

0.105 

7.0 

0.203 

3.9 

— 

— 

0.390 

3.0 

96 

0.100 

6.7 

0.176 

3.4 

— 

— 

0.360 

2.8 

128 

0.094 

6.3 

0.146 

2.8 

— 

— 

0.319 

2.5 | 


of higher computation-to-communication ratio and better load balance. Reor- 
ganizing OVERFLOW-D would achieve improve vector performance; however, 
extensive effort would be required to modify this production code. 

6 Summary and Conclusions 

This paper presented the performance of the Cray XI vector testbed system, and 
compared it against the cacheless NEC SX-6 vector machine and the superscalar 
cache-based IBM Power3/4 architectures, across a range of scientific kernels and 
applications. Microbenchmarking showed that the XI memory subsystem be- 
haved rather poorly, given its large peak memory bandwidth and sophisticated 
hardware support for non-unit stride data access. Table 5 summarizes the perfor- 
mance of the XI against the three other machines, for our synthetic benchmarks 
and application suite. All our codes, except N-BODY, were readily amenable to 
vectorization via minor code changes, compiler directives, and the use of ven- 
dor optimized numerical libraries. Results show that the XI achieves high raw 
performance relative to the Power systems for the computationally intensive ap- 
plications; however, the SX-6 demonstrated faster runtimes in all cases except for 
MADCAP. Surprisingly, the XI did not achieve a high fraction of peak perfor- 
mance compared with the other platforms, even though reasonably large vector 
lengths were attained for the applications. The N-BODY kernel is an irregularly 
structured code poorly suited for vectorization, and highlights the extremely low 
performance one should expect when running scalar codes on this architecture. 

Overall, the XI performance is quite promising, and improvements are ex- 
pected as the hardware, systems software, and numerical libraries mature. It is 
important to note that Xl-specific code optimizations have not been performed 
at this time. This complex vector architecture contains both data caches and 
multi-streaming processing units, and the optimal programming methodology is 
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Table 5. Summary' overview of Xl performance 


Application 

Name 

Scientific 

Discipline 

Lines 
of Code 

XI Speedup vs. 

P Power 3 Power4 SX-6 

P 

Power3 Power4 

FT 

Kernel 

2,000 

8 

10.0 

2.9 

0.9 

64 

11.1 

12.7 

N-BODY 

Kernel 

1,500 

8 

0.4 

0.2 

0.7 

64 

0.3 

0.1 

MADCAP 

Cosmology 

5,000 

4 

6.5 

3.5 

1.6 




PARATEC 

Mat. Sc. 

50,000 

8 

2.7 

1.2 

0.6 

128 

2.4 


OVERFLOW-D 

CFD 

100,000 

8 

3.8 

1.2 

0.5 

128 

3.4 

2.2 


yet to be established. For example, increasing cache locality through data block- 
ing will lower the memory access overhead; however, this strategy may reduce 
the vector length and cause performance degradation. Examining these tradeoffs 
will be the focus of our future work. 
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